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Wavelet Document Image Compression 

Field Of the Inventioii 

The present invention relates to a novel image compression technique for classing, 
matdui^ and identifying document images based on wavel^ compression method. 
This technique is called wavelet document image compression (WDIC) technique* 
More specifically, WDIC technique relates to separate the character/lines and pctmes 
fixmi flie backgrounds of an orignal document images and to use different techniques 
to compress each of those components. More generally, this technique may also be 
applied to other special documents such as particularly important historical 
documents, scientific papers with mathematical or chemical fi)nmi]ae^ software 
documents and some liand written signatures. 

Background Of the Invention 

As decttonic storage retrieval and distribution of documents becomes &st& and 
cheaper, a lot of documents are becoming mcreasingly digitally. In the last decade 
existmg documents are usually re-typed and converted to fflML or Adobe's PDF 
• format sometimes are used by Optical Character Recognition (OGl) technique. 
Uafi>rtunately, these techmques are still &r fi:om being able to translate feithfiiUy a 
scmmed document into weib page, much of the visual aspect of the original document 
is likely to be lost Recently, several authors [M] have proposed image4>ased 
approaches to digital documents* The'^hnage-based approach*' to digital documents is 
to store and to transmit documents as image. Traditional image compression standards 
such as JPEG and GIF are inappropriate for document image. Although they are 
suitable for continuous-tone unage (i.e. for most pictures of natural scenes, they are 
not for the sharp edges of character images. In the other hand, a scan docum«it tends 
to be quite large if one wants to preserve the readability of the text It is needed to 
develop an sppoBdx for compression document images &at makes it possible to 
transfer a hi^-quality of one page of docum^t image at very high compression ratio, 
the WDIC document image compression technique described here is designed to 
overcome all fiie above problems* 

Objective Of the Invention 

The object of the invention is to provide a novel image compression technique 
(WDIQ for classing, matching and identifying document images. A more specific 
object is to provide a wavelet-based compression algorithm for picture images, and a 
novel extent-based morphological matching, clustering and wavelet compression 
algorithm for mostly small ehamcter/hnes images. 

Summary Of the Invention 

The invOTtion comprises a numbK* of novel algorithms for an improved document 
image compression tedmique. The main idea of our document image compression 
technique is to extract two main categories of picture areas and charactei/line areas 
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fiom die docufflent image and encode the residue image by subtracted Ae$e two 



The character image can be encoded vnSi a novel extent-based moiphological 
matching; chistering and wavelet compression algorithm. A picture ima^ can be 
encoded with a wavelet-based compression algpriflmi, whiA is suitable for grey scale 
* * -. 'socanl 



WDIC is a progressive code. It provides progre^ ve decodmg not only m 
bad{ex)und, but also on character images. 

DetaUed Description Of die Invention 

In the following sections the novel techniques of the WDIC are described. Tie 
features of WDIC comprise special unage segmentation for a document image, fist 
classification, morpholo^cal matching and clustering algorithm for character/lines 
images, a wavelet-based compression algpriflun for picture images. Results from an 
actual system of the WDIC showed ihat the novel means contribute significant 
pefonnance improvement to two aspects of; highly effici^t compression fbraiat and 
a progressive range of compression rate scalabilify. 

Encoder 10 

Figure 1 is the blodc diagram of the encoder 10. We start our process fitnn a scanned 
greyscaledocumenthmge 101 wiA scanned resohrtionrdpi,Proces 102 extracts 
picture image blocks 103 fiom 101. 103 will be encoded by wavelet based SAQ 
encoder 104 ([5],[6]). Encoder 104 passes the compressed bit stream 105 to the 
processus. 

By subtracting the picture images 103 fiom 101, the residue image 106 is fiirfher 
processed by 107. Process 107 genmtes the connective blocfc fitnn residue image 



At first, WB po$terize the image into 3 levels as below, 
'O whenI(v)>P 

F(y)^^ 1 whenm^I(v)<P,y9hereI(Y)btheintensilyofAeplxelm 
2 vfhml(v)<m 

The following algorithm below performs at all vntraced pixels a with 
F(U)^1. 

1. 5-^,^1 =(i/).>F=^xa 

Cissltghdybrgerthmfonisizeofmsicharact^^ 

2. Find VbS^, {vJJ,, represent eight neighbor pixels of vin 
clodiwise order, among them {Vy,v^,v^,Vy) are 4-neighbor pixels 
ofv. Define VM^=Vf,keZ. ^ = 5'UM.iS, 



J. for v„/=;,".,«, |v^,-u,|s^.«m/|»^-.|tJ:SI7 
a. ifi'^aSJ. 

i. tfF(v)'2and(F(v,)=2or(F(v,)=latidF(v,.,)+F(v,^,)kI)) 
rteBSj=5,U{K,J 

II. ifF(v)^l<md(F(v,)=2cmd(F(v^)-¥F(y^,,)-i2)) 

if F(v)^2an(l(F(v,):^2 aadF(v,,,)-^F(v^j)i:l) 

4. ifS, *^,gpUt5t^2 

5. ^'={(x.y)\x^^X^X^,y^^y^y^] is character Image block, 
wherex^ =fim{vj.x„ ^nmfvj.y^ ^i>^{v,}.y^ ^m«c{v,} 

After character image block A is extracted and saved into the 
character block list. We mark the pixels in this block ae the traced 
pixels and change their value to.ZSS. And same procedure starts from 
mtxaced pixels satisfying F(tt)^2mtll ao such pixel exists. 

Character ima^ lOS are the blocks representing the lines and charactras ectracted by 
lOTfiom 106. Process 109 dusters Uie diatacter images hierardiically. We will 
elaboiate the process 109 in 401 to 413. Process 109 ou^s data 110 compiising the 
character template library and the code of the evay charactw blocks outputted fiom 
109. Tlie code of diaracter blocks indudes the absolute coordinates of flie block in die 
original image and tiie ind« of the teaqdate ituses. Charact^ eoooder 111 encodes 



enoodo: 

The output 112 is compressed bit stream for the characters. Whistle flie data 112 is 
passed to die process 118, it will be decoded by die decoder 113 wfaidi is die 
omnteipait of SAQ encoder 111. Thereconstracted character images 114 are used to 
get the bad^und image 115. Process 116 is encoder of wavelet based SAQ eocoda 
for grey scale image ([5], [6]). The compressed bit sUseam 117 forbadcground image 
is passed to flie process 1 18. TTie process 118 organizes Qie compressed bit stream of 
picture image Modes, diaracter image blocks and the badcground image to generate 
the compressed data 119 for the whole documoitim^ 



Data 119 is organized as flie following. We save die document image header and 



^— ^^ww— ^.wnw Maw AWWHUWJU UAWIUASAUWII \9X ^iWiUiO imtlgC OlOdCS 

first flien the oon^ressed bit planes corresponding value greater fltan H of diaracta 
template library, picture image blocks and background image will be stored; finally 
die residue bit plane information will be added one bit plane followed by another fiom 
Ae most dgnificant bit plane to die least significant bit plane. Sudi oi^zation 
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guaiantees flie progress decoding of the docoment image. In die ofhira* words^ we can 
obtain the document image fiom bhir version to die finest one. 

Picture image block extractor 102 

Picture image block extractor 102 is elaborated in the following. 

Process 301 estimate the peak value Pq of histogram of docum^t image^ threshold 

?=n28+/^o>)^^>^^P^®^^^^^^^ofp^x^l^^^ ^ aredassifiedas 
foreground pixel, oth^ pixds are background pixel. 

Process 302 partitions entire document im^e into blocks wi& size I^xFT where 
|lo&;r/4| 

W^T- ^ and I* is the scanned resohition. 

Process 303 classify blocks to two types: picture block maiked by 1 and nonpicture 
blodc maiked it by 0. The v^ct is based on Ae sbitistical features of wavelet 
deoonq)08itiQn of blodcs. The procedure is as Mowing 

the wavelet filt^ to decompose die block once as conventional wavelet 
decomposition of image. For the computation efificiaicy, the sum of filter coefficients 
is 2 and the suggested filt^ for this procedure is Haar wavelet The figure below 
shows thb procedure. LL, LH, HLand HH are die notations of lowest frequency 
cbmpon^ to highest frequency component as usual ([7]}. 



LL 


LH 


HL 


HH 



In generally, a document image is typically conqiosed of a large portion of characters 
and edges regions, together with a rather small portion of homogeneous regions. 
Homog^eous regions have the least variation. Characters regions have moderate 
variation; and Imes show die most variation, 
fl when A 

g(c)=^_ . ' where^lij a predefined flireshoId(defeult ^16) and c is the 
\p otnetwiss 

wavelet coefiBcients. 

Calculation the sums of wavelet coefficients. The statistical variable we used in 
das^cation is Mowing, cowa„ - ^'''^^ — ,whereH=HL[jLH[JHH 

averagBi^ "^Tc^ number of wavelet coefficients of XX. 

If countff <Band m^erage^^ < fP+lZSy 2, where5 is die predetamined threshold 
whose default value is 3, die block is marked as picture block, otherwise it is marked 
as nonpicture block 
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Switch 304 checks whetfa^ untraced picture block exists. If the answer is NO, all 
picture blocks are saved in data 3 16 already and finish process 102 . 

Oflierwisei the next untraced picture block is identified in step 305 and change its 
mark to zeco, and the picture area is initialised to flie minimum rectangle containing 
current blodc in step 306. 

The process 317 is to extract the rectangle area of picture image and consists of two 
steps. Firstly, piocess 318 extracts Che picture blodcs. Th^ this area will finrEha grow 
to its ndghbour pixds in process 319 if necessaiy. 

Switch 307 checks whether there is a neighbour block of current picture area whose 
mark is 1. If the answer is YES, mark this blodc 0 in 308 and ext^d picture area to a 
new rectangle area containing this block in process 309, go bade to switch 307. If the 
answer is NO, all nd^Ubour blocks are not picture block. We finish process 318 and 
gOtoswitdi310. 

Switdi 310 diedcs that whe&er the rectangle picture area is big enou^ by comparing 
the length and width to fte preset value (default 2W)Af answer is NO, there is no 
picture area found and turn to switch 304. Otherwise, the answer is YES, we store Oie 
locaticsi infonnation of the picture area in 311, 

Process 319 comprising following steps refines the picture area. Switch 312 diedcs 
whefh^ there is a foie-pixel in the nei^our pixds of current area. If the answer is 
YES, process 313 extends Ifae picture area to flie new rectan^Je picture area 
containing the found fore-pixd and we go bade switdi 312. If the answer is NO, all 
neig^our pixds of cuirkt picture area are badc-pixels. Piocess 319 finishes and save 
this rectangle picture as a picture image area in piocess 314. Process 315 appends fiiis 
picture image area to the list of picture images ^en we go back to switch 304. 

Process 401 generate the style of characters 

^ = {(U) I HU) <P>i^ O'l-'v A -1. ; = OX... U y^^^^ Hhj) w intensify of 

pixel at coodimies (ij), w is width, h is height. P is estimated in step 301 

blod: distance of two pixels are d^ned as d((ixJi),(i^,j^)) =! £i -ii I + 1 ""/i I 

-rnin(<^((^-^'WJ))) . -minidii^-'^^h-iui.m 

The style ofthisdiaracter is (^Adfi4i^,d^,d,i^)' 
We define three sets l^.L^.L^ forprocess 109. Zigis the collection of diaracter 
images blodcs. £, is the collection of the duiracter code infonnation of the diaracter 
image blocks. is library of diaracter templates used to save the unages of 
character tmplates. Switch 402 checks whether 4 Is eQipty> if it answers YES» it 
means all character blocks have been processed, then data 403 comprising and 
will be outputted and terminate the process 109. OQierwise, the answer is NO in step 
402| we will get the next character blodc 7in in process 404. 414 is {oocess of 
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matdung diantcter block 7 against templates in • We start fiam the bead of . 
CSLedcwheAeraUt^latesin have been media 406* 

If the aDs wff is YES, it means f is a new type of diaiacte; in stqp 407 we append it 
10 Li as a new (^acter template 7£, save &e code infomwtiQiL of raigai^ 

Xi , and remove Tfiom , tiien back to switch 402. 

Othervns6,ifthea]]swerisNOiQ406,gettt6chamctertemplate7Lfiom/2 i^^^ 
408. We match fagainst TL by two steps, first matdifag^inst 21 m process 409 by 
their s^le» Switdi 410 chedcs &e result of process 409, if Ihe answer is NO, go to 
406. If the answer is YES, then match fagainst TL by morphological character 
matching meOiod ffl 411. 

411 uses a moiphological approadi with which the matching of two diaractm is fiist 
and accurate compared to the conventional matching method such as the matdiing by 
tiie grey scale similaiiiy. Hie new measuiement based on moxphological is better than 
Euclidean distance measuiement and Housdorff measurement in flie case of noise 
enviionm^ due to the stability of the measurement 

The new morphological operator measures the size of the diffi^ce image of two 
images (one is the template and the other is character block). Assume the two hnages 
axe/and the difference image fg is defined as follow 

fl, F((x,y)^)'^F((x.y)^)<4md\f(x,yh^^ 

» 

0, otherwise 
ikresholdC^^31 

The different image /gr is a bmaiy unage, in Oie oth^ words, it is a binary set 
Define die size of set A of structure element B as e(A)g = stp( A^0B^^},a^^ 

a 

'where A^BisMrmalniorphologfcxdopenoperaior, 

The new measuranent of the diffbtence b^een two bm^ sets can be defined as 
SB(f»s)~^f''S)B^^^^^ Bissquaresiructaredenm^of skel. 
The similarity measure oftwo sets/ g is M{f,g)=msx{SB(f-g), ^Bte-/)}.The 
new measurement is symmetric in the sense of the distortion is concave distortion or 
convex distortion; however^ file HausdorjBTmeasurement is not symmetric [8]. 

If die measure is less fiian file avierage size of the noise region, the matching is 
success. We develop a fiist algoriflim based on fins theoiy fi>r matching of character 
problem. The measure of the difference is modified as M( f,g )^Sg( f^g) . For fiie 
matching for character image wifii resolution no less fiian 72, if fiie measure is less 
fiian 2, the matching of character against template is success. The algorifiun is as 



Algoxithm Mi 

1. Suppose^ 'g)(x) is a sequence with length m, xir-Q, 

2. if(f-g)(x)^0,gotostep5 
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3. ifff-g)(X'^l)^0,(f-g)(x)^0,gotostep5 
4. 

5. if(x<m'-l)x^x^l ff>tostq^2 

6. end 

Algoritlm Mz ; 

1. Siq^seff 'f^fxjisasequenceirithlengthm, x^O, 

2. ifff-g)(x)^land(f'g)(x^l)^lgoidst^5 

4 . character matches againsjt template, go to step 6 

5. character does not TTiatch against template, go to step 6 

6. end 

Hie condition is weak ox not depends on the stiucture dement med in tiie algorism 
Ml and tfie associated pait of algorithm Mj. Hot the condition is strong means that it 
is difficult to match a character against template. On the contrary, the condition is 
weak means it is very easy to matdi a character ag9inst a template. Strong condition 
vrill decrease die compressioxiiatio sli^y but weak condition will generate false 
matching and the reconstructed character may not be correct when tbte scanned 
document image quality is very poor. The order of line^ circle to square corresponds 
to file conditions fiom strong to weak. 

Note: 

1. If algorithm Mi performs only along row direction, the structure element used 
in the matching algorithm is line of horizontal direction. This elsnent is good 
enough for the En^ish character matching. 

2. If algori^ Mi performs only along column, the stmcture element used in the 
matdiing algorithm is line of vertical direction. 

3. Ifalgorithm Ml perfonnsalongboth row and cohmm directions, the structure 
element used in the matching algorithm is drde. CSrcIe structure dem^ 
works well for character of most languages. 

4. If algorithm M] performs along row direction followed by column direction 
and then performs along column direction followed by row direction, the 
staucture element used in the matdhing algorithm is square. 

5. For the structure elem^t of lines we only need spply algorithm M2 along 
same direction as M| does. For the stmcture element circle algorithm l/k 
performs at either horizontal direction or vertical direction. For the structure 
element square, algorithm M2 performs at both horizontal and vertical 
directions before we can conclude fliat the matdi is success. 

Switch 412 checks whether Tmatches against TL. If the answer is NO, go back to 
switch 406.Otherwise, the answer is YES, information of 7is upended to I, and 
code of T is index of pattern 7£ in £| , then process 413 removes ffrom Ip, then we 
go to switch 402. 
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Dec d r20 

Figure 2 is WDIC decoder 20 fhal is flie zeveise process of encoder 10. Decode 20 
starts fixmi compressed bit stieam 201 of document image. Process 202 separates 201 
to fluee parts based on the formats of compressed domnent image desoibed in IIS. 
These duee parts are compressed bit stieam 203 of badcground unage^ conxpressed bit 
stream 206 of character image blocks and compressed bit stream 209 of picture image 
blocks. 

Data 203 is decoded by wavelet based S AQ decoder 204 to generate the background 
image. Data 206 is decoded by character decoder 207 to generate the information of 
character codes of diaiacter image blocb and character template library. Data 209 is 
decoded by wavelet based SAQ decoder 210 to genoate the picture image blocks 211. 
Data 205, 208 and 21 1 will be combmed to document unage 213 in process 212. 

Claims 

What is claimed: 

1. The m^hod of encoding of docum^t image comprising of the steps of 
extractmg two main categories of picture areas and chaiacteiyiine areas from 
the docum^ image and, 

obtaining the residue image by subtracted these two categories areas fiom 
document images^ and, 

classifying the chaiactec/lme accordmg to fte toDiplates dynamical generated, 
and 

encoding the residue image by wavelet based SAQ method. 

2. The method of extracting pcture areas fiom document image according to 
claim 1. 

3. The mediod according to claim 2 comprising marking fiie blocks partitioned 
fiom document image based on features of their wavelet coefficients, 

4 The rnebod according to claim 2, fbft^ comprising hierazchicalpic^^ 
extractmg mediod comprising steps of 

extracting the picture blocks first to g^eratc the initial picture area and, 
refining Qie picture area to cover the neighboxur picture pixels of original area. 

5. The meftod according to daim I, further comprising the method of ^ 
character/line areas fix>m document image, 

6. The method according to daunS comprising ofspecial definition of Ae 
connectivity, 

7. The method acconhng to daimSfiurdiercomprismgoffhemeaiod of 
extracting the diaracter/line ar^, 

8. The method according to claim 1, further comprising the method of dassL^ng 
diaracter/line, 

9. The mediod according to daimS comprising generating s^e of the 
character/line areas, 

10. The meftod according to claim 8 firrOer comprising the hierarchical matching 
of the diaracter/line area against cbaracter/Iine t^plates comprising steps of 
matching the styles of character/lme areas against styles of templates first and, 
matching of be diaracter/line areas agamst template 

1 1 . The method of matdiing of the diaracter/line areas against template 
comprising of morphological matching, 
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12. the method accoiding to claim 11, coQqprismg specific chaiacte^^ 
matching algorithms M| and M2, 

13. Hie method according to daiml I, fbrtfaer conqprising die method of using 
differ^ stracture element for the diifersit kind of docum^t image, 

14. The mediod according to daim 1, &rther comprising of bit plane storage of 
the compressed stream of the document miage by the ord^ of charactei/line^ 
picture and badcgronnd image vhidi can be progressive decoding by 
assodated decoder, 

15. The assodated decoder of encoder in dahn L 
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